Mihai Ion
  • Home
  • Research
  • Teaching
  • CV
  1. FIN 525
  2. Lectures
  3. L20: Backtesting - data prep
  • Teaching
  • FIN 525
    • Lectures
      • L00: Jupyter basics
      • L01: Introduction
      • L02: Variables, types, operators
      • L03: Data structures
      • L04: Conditionals, loops
      • L05: Functions, packages
      • L06: Pandas intro
      • L07: Pandas I/O
      • L08: Pandas filtering
      • L09: Pandas data cleaning
      • L10: Merging, reshaping datasets
      • L11: Dates, lags, sorting
      • L12: Descriptive stats
      • L13: Conditional stats, outliers
      • L14: Conditional stats applied
      • L15: Linear regression intro
      • L16: Linear regression applications
      • L17: Panel regression intro
      • L18: Robust panel regression
      • L19: Robust timeseries regression
      • L20: Backtesting - data prep
      • L21: Backtesting - sumstats
      • L22: Backtesting -returns
      • L23: Backtesting - risk adjustment
  • FIN 421
    • Lectures
      • L01: Introduction
      • L02: Analyzing past returns
      • L03: Modeling future returns
      • L04: Portfolio theory intro
      • L05: Optimal capital allocation
      • L06: Tangency portfolios
      • L07_08: Optimal asset allocation
      • L09: Review
      • L10_11: Statistical models of returns
      • L12: CAPM
      • L13: Cost of equity
      • L14: Bond pricing
      • L15: Bond yields
      • L16: Bond risk
      • L17: Valuation data processing
      • L18_19: Multiples valuation
      • L20_21: Dividend discount models
      • L22_23: Discounted cash flow analysis
      • L24: Valuation sensitivity analysis
      • L25: Options intro
      • L26: Risk management with options

On this page

  • Lecture Overview
  • Data
    • Accounting data (Compustat)
      • Data cleaning
      • New variables
    • Market data (CRSP)
      • Data Cleaning
      • Calculate CRSP variables for Table 1

L20: Backtesting - data prep

# Import packages
import pandas as pd
pd.options.display.max_rows = 20

Lecture Overview

We will spend the following four lectures working through backtesting the asset-growth anomaly. Here are the overall steps:

  • Data
    • Load “comp_clean” put filters on and create (some of the) variables in Table 1
    • Do the same for “crsp_clean”
    • Merge the two datasets above
  • Table 1
    • Create AG deciles
    • Summarize selected firm characteristics at the AG-decile level
  • Create portfolios
    • Merge AG deciles with monthly returns data
    • Create variable that tells us which stock is in which portfolio at all times
  • Table 2
    • EW portfolio returns
    • VW portfolio returns
    • Risk-adjusted EW portoflio returns
    • Risk-adjusted VW portfolio returns

Data

Accounting data (Compustat)

Data cleaning

# Load cleaned Compustat data
raw = pd.read_pickle('../data/compa.zip')
raw.shape
(278226, 23)
# Keep only what we need
comp = raw[['permno','datadate','at','revt','cogs','dltt','dlc','ib']].copy()
comp.dtypes
permno      float64
datadate     object
at          float64
revt        float64
cogs        float64
dltt        float64
dlc         float64
ib          float64
dtype: object
# Clean firmid and dates
comp['dtdate'] = pd.to_datetime(comp['datadate'])
comp['year'] = comp['dtdate'].dt.year
comp['permno'] = comp['permno'].astype('int64')
comp.dtypes
permno               int64
datadate            object
at                 float64
revt               float64
cogs               float64
dltt               float64
dlc                float64
ib                 float64
dtdate      datetime64[ns]
year                 int64
dtype: object
# Keep postive total assets, sort and drop duplicates
comp = comp[comp['at']>0].copy()
comp = comp.sort_values(['permno','year'])
comp = comp.drop_duplicates(['permno','year'], keep='last', ignore_index=True)
comp.head()
permno datadate at revt cogs dltt dlc ib dtdate year
0 10000 1986-10-31 2.115 1.026 0.511 0.058 0.968 -0.730 1986-10-31 1986
1 10001 1986-06-30 12.242 21.460 19.565 2.946 0.343 0.669 1986-06-30 1986
2 10001 1987-06-30 11.771 16.621 15.538 2.750 0.377 0.312 1987-06-30 1987
3 10001 1988-06-30 11.735 16.978 15.556 2.555 0.325 0.542 1988-06-30 1988
4 10001 1989-06-30 18.565 22.910 19.856 7.370 0.185 1.208 1989-06-30 1989
comp.shape
(236316, 10)

New variables

# Calculate some of the variables in Table 1
comp['at_lag1'] = comp.groupby('permno')['at'].shift(1)
comp['AG'] = comp['at'] / comp['at_lag1'] - 1
comp['L2AG'] = comp.groupby('permno')['AG'].shift(1)

comp['Leverage'] = (comp['dltt'] + comp['dlc']) / comp['at_lag1'] 
comp['ROA'] = comp['ib'] / comp['at']
# Keep only the variables we need, and observations with non-missing AG
comp = comp.loc[comp['AG'].notnull() ,:].copy()
comp_cgs = comp[['permno','year','dtdate','AG','L2AG', 'at','Leverage', 'ROA']].copy()
comp_cgs.head()
permno year dtdate AG L2AG at Leverage ROA
2 10001 1987 1987-06-30 -0.038474 NaN 11.771 0.255432 0.026506
3 10001 1988 1988-06-30 -0.003058 -0.038474 11.735 0.244669 0.046187
4 10001 1989 1989-06-30 0.582020 -0.003058 18.565 0.643801 0.065069
5 10001 1990 1990-06-30 0.017021 0.582020 18.881 0.396984 0.059901
6 10001 1991 1991-06-30 0.038028 0.017021 19.599 0.380012 0.054748
comp_cgs.shape
(212504, 8)
# Save this for later use
comp_cgs.to_pickle('../data/comp_cgs.zip')

Market data (CRSP)

Data Cleaning

# Load CRSP data
crsp = pd.read_pickle('../data/crspm.zip')
crsp.shape
(2553287, 10)
crsp.dtypes
permno    float64
permco    float64
date       object
prc       float64
ret       float64
shrout    float64
shrcd     float64
exchcd    float64
siccd     float64
ticker     object
dtype: object
# Keep only what we need
crsp = crsp[['permno','date','ret','prc','shrout','siccd']].copy().dropna()
crsp.shape
(2530665, 6)
# Get rid of financials (first digit of siccd = 6)
crsp = crsp.loc[crsp['siccd'].astype('string').str[0] != '6', :].copy()
crsp.shape
(2048105, 6)
# Clean up firmid and dates
crsp['dtdate'] = pd.to_datetime(crsp['date'])
crsp['mdate'] = crsp['dtdate'].dt.to_period('M')
crsp['permno'] = crsp['permno'].astype('int64')

# Drop duplicates and sort
crsp = crsp.sort_values(['permno','mdate'])
crsp = crsp.drop_duplicates(['permno','mdate'], keep='last',ignore_index=True)
crsp.shape
(2048105, 8)
# Calculate market capitalization (in millions)
crsp['MV'] = crsp['prc'].abs() * crsp['shrout'] / 1000
# Calculate lagged market cap
crsp.sort_values(['permno','mdate'], inplace=True)
crsp['mktcap_lag1'] = crsp.groupby('permno')['MV'].shift(1)    
# Keep only the variables we need
crsp_cgs = crsp[['permno','mdate','dtdate','ret','MV','mktcap_lag1']].copy()
crsp_cgs.head(2)
permno mdate dtdate ret MV mktcap_lag1
0 10000 1986-02 1986-02-28 -0.257143 11.96 NaN
1 10000 1986-03 1986-03-31 0.365385 16.33 11.96
# Save for later use
crsp_cgs.to_pickle('../data/crsp_cgs.zip')

Calculate CRSP variables for Table 1

# Calculate cummulative returns over the past 6 months (BHRET6)
crsp['BHRET6'] = 1 
for i in range(0,6):
    crsp['BHRET6'] = crsp['BHRET6'] * (1 + crsp.groupby('permno')['ret'].shift(i))  
    
crsp['BHRET6'] = crsp['BHRET6'] - 1         
crsp['BHRET6'].describe()
count    1.957695e+06
mean     6.840459e-02
std      5.437771e-01
min     -9.999818e-01
25%     -1.941748e-01
50%      1.254532e-02
75%      2.265447e-01
max      6.691429e+01
Name: BHRET6, dtype: float64

Challenge:

Write a function called compound that calculates cumulative returns over a sequence of months and adds them as another column in the crsp dataframe (i.e. exactly what we did above, with BHRET6 but for arbitrary number of lags, not just 6).

def compound(dset=None, outvar=None, 
             firmid='permno', datevar='mdate', retvar='ret',
             startlag=None, endlag=None):
    
    dset.sort_values([firmid,datevar], inplace=True) #side effect
    
    dset[outvar] = 1
    for i in range(startlag,endlag+1):    
        dset[outvar] = dset[outvar] * (1 + dset.groupby(firmid)[retvar].shift(i))
        
    dset[outvar] = dset[outvar] - 1
    return
compound(crsp,'mybhret6',startlag=0, endlag=5)
crsp.head(2)
permno date ret prc shrout siccd dtdate mdate MV mktcap_lag1 BHRET6 mybhret6
0 10000 1986-02-28 -0.257143 -3.2500 3680.0 3990.0 1986-02-28 1986-02 11.96 NaN NaN NaN
1 10000 1986-03-31 0.365385 -4.4375 3680.0 3990.0 1986-03-31 1986-03 16.33 11.96 NaN NaN
compound(crsp,'BHRET36',startlag=0, endlag=35)
crsp.head(2)
permno date ret prc shrout siccd dtdate mdate MV mktcap_lag1 BHRET6 mybhret6 BHRET36
0 10000 1986-02-28 -0.257143 -3.2500 3680.0 3990.0 1986-02-28 1986-02 11.96 NaN NaN NaN NaN
1 10000 1986-03-31 0.365385 -4.4375 3680.0 3990.0 1986-03-31 1986-03 16.33 11.96 NaN NaN NaN

Challenge:

Create a new variable called FBHRET12 which equals the (net) cumulative returns in the 12 months FOLLOWING the current month (t+1 to t+12 inclusive).

compound(crsp,'FBHRET12',startlag = -12, endlag = -1)
crsp.head(2)
permno date ret prc shrout siccd dtdate mdate MV mktcap_lag1 BHRET6 mybhret6 BHRET36 FBHRET12
0 10000 1986-02-28 -0.257143 -3.2500 3680.0 3990.0 1986-02-28 1986-02 11.96 NaN NaN NaN NaN -0.875000
1 10000 1986-03-31 0.365385 -4.4375 3680.0 3990.0 1986-03-31 1986-03 16.33 11.96 NaN NaN NaN -0.943662
# Keep only the variables we need we need for Table 1
crsp_table1 = crsp[['permno','mdate','MV','BHRET6','BHRET36','FBHRET12']].copy()
crsp_table1.head(7)
permno mdate MV BHRET6 BHRET36 FBHRET12
0 10000 1986-02 11.960000 NaN NaN -0.875000
1 10000 1986-03 16.330000 NaN NaN -0.943662
2 10000 1986-04 15.172000 NaN NaN -0.941406
3 10000 1986-05 11.793859 NaN NaN -0.929648
4 10000 1986-06 11.734594 NaN NaN NaN
5 10000 1986-07 10.786344 -0.350000 NaN NaN
6 10000 1986-08 4.148594 -0.663462 NaN NaN
# Save for later use
crsp_table1.to_pickle('../data/crsp_cgs_table1.zip')